Decision tree learning

Decision tree learning uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. It is one of the predictive modelling approaches used in statistics, data mining and machine learning. Tree models where the target variable can take a finite set of values are called classification trees. In these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees.

https://en.wikipedia.org/wiki/Decision_tree_learning


In [1]:
# Decision Tree Regression
import numpy as np
from sklearn import datasets
from sklearn.tree import DecisionTreeRegressor
import pandas as pd

In [2]:
# load the diabetes datasets
dataset = datasets.load_diabetes()

In [3]:
#Let us now build a pandas dataframe hosting the data at hand

# We first need the list of feature names for our columns
# BMI is the Body Mass Index
# ABP is the Average Blood Pressure
lfeat = ["Age", "Sex", "BMI", "ABP", "S1", "S2", "S3", "S4", "S5", "S6"]

In [4]:
# We now build the Dataframe, with the data as argument
# and the list of column names as keyword argument
df_diabetes = pd.DataFrame(dataset.data, columns = lfeat)

In [6]:
# Let's have a look at the first few entries
print "Printing data up to the 5th sample"
df_diabetes.iloc[:5,:] # Look at the first 5 samples for all features.


Printing data up to the 5th sample
Out[6]:
Age Sex BMI ABP S1 S2 S3 S4 S5 S6
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019908 -0.017646
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068330 -0.092204
2 0.085299 0.050680 0.044451 -0.005671 -0.045599 -0.034194 -0.032356 -0.002592 0.002864 -0.025930
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022692 -0.009362
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031991 -0.046641

In [ ]:
# We also want to add the regression target
# Let's create a new column :
df_diabetes["Target"] = dataset.target # Must have the correct size of course

In [ ]:
#Let's review our complete dataframe:
print
print "Printing data up to the 5th sample"
print "Also print the target"
print df_diabetes.iloc[:5,:] # Look at the first 5 samples for all features incuding target

In [ ]:
# we are now going to fit a Regression Tree model to the data

# Essentially, Regression trees create a partition of the feature space
# In 2D this can easily be visualised as splitting a rectangle into
# a set of non overlapping sub-rectangles
# Once the partition has been created, a simple model is fit in each sub partition
# generally, a simple constant is split

# Trees are easy to interpret, think of them as block wise approximations to the regression target

# Trees are built through a recursive binary partition of the feature space
# A simple criterion can be used (squared error given a split) to find out
# on which feature and on what value of the feature we make a partition

# Tree parameters
# For instance, knowing how deep the tree should be, that is how many partitions should be made
# can be determined empirically on validation data to make sure the tree generalises well

#As before, we create an instance of the model
model = DecisionTreeRegressor()

In [ ]:
# Which we then fit to the training data X, Y
# with pandas we have to split the df in two :
# the feature part (X) and the target part (Y)
# This is done below :

data = df_diabetes[lfeat].values
target = df_diabetes["Target"].values

model.fit(data, target)
print(model)

In [ ]:
# summarize the fit of the model
# we can estimate the performance of the fit using the mse metric
# in this case, we simply compute the mean of the squared error on the sample
# the lower, the better
mse = np.mean((predicted-expected)**2)
print("Residual sum of squares: %.2f" % mse)
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % model.score(data, target))

In [ ]:
# This time we can see we got a perfect prediction

Business formulation - continued

We will see that this data can be used to give us expertise in diagnosing diabetes using trees which are a very natural of solving the problem.

Indeed, a medical examination generally works as a series of questions (from the doctor) and answers (from the patient) After enough questions, the doctor has a good idea of the patient's status. Likewise, you can picture a tree as a function trying to isolate groups of patients with similar characteristics So when a new patient comes in, the tree asks the same set of questions to find out to which group the patient belongs then the tree will state that the patient should approximately be in the same state as the other patients of this group

This is how decision trees work : using training data, it learns to ask the best set of questions to assign the disease status to a patient. When a new patient comes in, his disease status is inferred from his answers


In [ ]:


In [ ]: